Comparative study on chloroplast genomes of three Hansenia forbesii varieties (Apiaceae)

To find the gene hypervariable regions of three varieties of Hansenia forbesii H. Boissieu and determine their phylogenetic relationship, the chloroplast (cp) genome of these three varieties were firstly sequencing by the Illumina hiseq platform. In this study, we assembled the complete cp genome sequences of Hansenia forbesii LQ (156,954 bp), H. forbesii QX (157,181 bp), H. forbesii WQ (156,975 bp). They all contained 84 protein-coding genes, 37 tRNAs, and 8 rRNAs. The hypervariable regions between three cp genomes were atpF-atpH, petD, and rps15-ycf1. Phylogenetic analysis showed that H. forbesii LQ and H. forbesii WQ were closely related, followed by H. forbesii QX. This study showed that the three varieties of H. forbesii could be identified by the complete cp genome and specific DNA barcode (trnC-GCA-petN) and provided a new idea for germplasm identification of similar cultivated varieties.


Introduction
Hansenia forbesii H.Boissieu is a rare and endangered perennial herb belonging to the genus Hansenia (Apiaceae), and the Chinese name is "Kuan-ye-qiang-huo". The main chemical components isolated from H. forbesii are volatile oils, terpenoids, coumarins, sugars and glycosides, alkynes, phenolic acids, steroids, flavonoids, and alkaloids. It has anti-inflammatory, antibacterial, antioxidant, antiviral, anticancer, antipyretic, and analgesic activities, and has a significant impact on the cardiovascular and cerebrovascular, digestive, respiratory, and central nervous systems [1][2][3][4]. In the traditional production areas of Hansenia, the loss of germplasm resources is an extremely serious problem. As the problem of artificial breeding has not been solved, it is difficult to collect enough seeds, roots, and other reproductive materials in the genuine production areas and even enough genetic standard materials, showing that the severity of the risk of endangerment of Hansenia cannot be ignored. The wild-growing to domestic cultivation for Hansenia and the establishment of a standardized production base are the most effective ways to ensure the quality of medicinal materials, meet the medicinal requirements, protect wild resources and achieve sustainable development. Due to the wide distribution range and low altitude of H. forbesii, the requirements for environmental and soil conditions are relatively loose, the root system is well developed and the medicinal value is the same as that of H. weberbaueriana. Therefore, vigorously promoting the artificial cultivation of H. forbesii can reduce the pressure on this resource. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 The main distribution types of H. forbesii are purple stem type, green stem type, large leaf type, small leaf type, root single branch type, root dispersed type, and other mixed species. The excellent varieties of H. forbesii have not been popularised and applied in production and cultivation, which cannot meet the needs of its industrial development. Moreover, there is a serious problem of mixed varieties in field production, which has a significant impact on its high-yield cultivation and is the bottleneck for the development of the H. forbesii industry. Before this study, we tried to address these issues by the following actions. Taking the wild H. forbesii (Wei-Qiang) resources in Gansu Province as the original materials, using the single plant optimization method and taking the plant agronomic characters, susceptibility, and drug grade as the indicators, we have selected two new varieties of H. forbesii (Long-Qiang and Qiang-Xuan) with high quality and high yield through line identification, line comparison, multi-point test, and production test. But it is not clear what kind of variation exists at the gene level and what is their phylogenetic relationship.
Chloroplast is an essential organelle in plant cells and it plays an important role in photosynthesis, carbon sequestration, and amino acid synthesis. Its complete genome contains a large amount of genetic information and its structure is highly conserved, including gene content, gene organization, and intron content within genes [5]. Chloroplast (cp) genome has also undergone intron loss, gene loss, gene duplication, gene rearrangement, pseudogenization, as well as uneven contraction and expansion of IRs regions. These genomic events can in turn become synapomorphies for entire clades [6]. In addition, many mutational events take place within the chloroplast genome including substitutions, insertions-deletions, inversions, microstructural changes, and oligonucleotide repeats [7][8][9]. The cp genome is slow evolving [10] and inherited from a single parent: maternally in most angiosperms [11] and paternally in some gymnosperms [12]. It is widely used in the study of plant molecular evolution and phylogeny, as well as in genetic transformation, genetic engineering, and molecular breeding [13][14][15]. Recently, with the gradual popularization and low cost of high-throughput sequencing technology, some scientists have proposed that cp genome data can be used for species identification and interspecific differentiation research [16][17][18][19][20]. There is no significant difference in some appearance indicators among the three H. forbesii. Through comparative analysis of their cp genome, the differences between them can be identified at the genetic level, and polymorphism markers can be developed to achieve variety identification. This is necessary for solving the problem of mixed varieties of H. forbesii, and has good significance for promoting excellent variety.
To find the gene hypervariable regions of three varieties of H. forbesii and determine their phylogenetic relationship, we used Illumina hiseq technology to sequence the cp genome. And we compared the characteristics of the cp genome, screened the different nucleotide sequences, and deciphered the genetic relationship to provide the basis for the differentiation of target varieties with excellent and optimal shapes.

Materials and pretreatment
The fresh, tender, and healthy leaves of three H. forbesii were collected from the germplasm resource garden in Taoyang Town, Lintao County, Gansu Province in China on 30 September 2021. Professor Zhirong Sun, a scientist in the national traditional Chinese medicine system, identified that all the samples were Hansenia forbesii H.Boissieu (Apiaceae). The plant materials produced and used in this study comply with China guidelines and legislation. All the experiments were carried out to national and international guidelines. The aboveground forms of three H. forbesii, are shown in Fig 1 and described in S1 Table, which are named QX, LQ, and WQ according to their morphological characteristics and breeding process. After cutting off the veins of leaves with clean scissors, we put the remaining leaves into molecular bags, added silica gel, and saved them. The voucher specimens (QH210930) of all materials were kept in the herbal medicine library of the School of Chinese Materia Medica, Beijing University of Chinese Medicine.

Total DNA extraction and library preparation
The total DNA was extracted using a small plant DNA Extraction Kit (Omega Bio-Tek, USA). Then we randomly broke qualified DNA samples using an ultrasonic crusher (Covaris, USA), and then completed the whole library preparation through the steps of end repair, adding Atailing, adding sequencing connector, purification, and PCR amplification [21]. The qualified library was sequenced using the Illumina high-throughput sequencing platform (Illumina, NovaSeq 6000, USA).

Chloroplast genome assembly and annotation
The high-quality reads were assembled using Get Organelle v.4.0 and then annotated by CpGAVAS2 [22]. The annotations of tRNA genes were confirmed by using tRNAscan-SE v.2.03 [23].
The cp genome of LQ, QX, and WQ were submitted to GenBank at the National Center of Biotechnology Information (NCBI), and the accession numbers were ON 208134, ON208135, and ON208136, respectively. Fully annotated plastome circular diagrams were drawn by Orga-nellarGenomeDRAW (OGDRAW) [24,25].

Repeat analysis and comparative analyses
Repetitive sequence analyses were performed using CPGAVAS2 analysis. Tandem repeats were identified using default settings by Tandem Repeats Finder [28]. The Misa. pl was used to screen the simple sequence repeats (SSRs) [29]. The scattered repetitive sequences were found by using VMATCH. The REPuter was used to determine the size and location of the oligonucleotide repeats (ORs) [30]. The complete cp genome of three cultivated varieties was compared by mVISTA [31], and the genome of H. forbesii (NC035054) was used as the reference sequence for annotation. Sliding window analysis was conducted to assess the nucleotide diversity (Pi) values of the cp genome by DnaSP v6 (window length = 300 bp, step size = 25 bp). IRscope [32] was used to analyze inverted repeated traction and expansion at cp genome junctions.

Identification and validation of barcode for varieties discrimination
According to the results of DNAsp, we chose the high variation region to distinguish the three varieties. Primers to discriminate between the three varieties under study were designed on the variable intergenic regions using Snapgene 6.2.1 (Snapgene from Insightful Science, available at http://www.snapgene.com, last used in 2023). PCR amplifications were performed in a final volume of 20 μL with 10 μL 2×Taq PCR Master Mix, 0.5 μM of each primer, 5 μL template DNA, and 4 μL ddH 2 O following the manufacturer's instructions (Mei5 Biotechnology, Co., Ltd). All amplifications were carried out in a Pro-Flex PCR system (Applied Biosystems, Waltham, MA, USA) under the following conditions: denaturation at 95˚C for 3 min, followed by 36 cycles of 94˚C for 25 s and 55˚C for 10 s, and 72˚C for 2 min as the final extension following the manufacturer's instructions (Mei5 Biotechnology, Co., Ltd). PCR amplicons were visualized on 1% agarose gels, purified, and then subjected to bidirectional Sanger sequencing on an ABI 3730 XL instrument (Applied Biosystems, USA) using the same set of primers used for PCR amplification with BigDye v3.1 chemistry (Applied Biosystems) following manufacturer's instructions. All amplifications were repeated three for each variety.

Phylogenetic analysis
Phylogenetic analysis was performed based on 37 complete cp genomes, including the three assembled sequences in our study, 34 cp genomes (30 Genus, 2 Family) downloaded from the NCBI and Cornus officinalis (NC042746) as outgroup. The shared protein-coding genes were extracted using Phylosuite v.1.2.2 software, and the MAFFT v.7.307 was used for alignment [33]. Subsequently, the alignment was conducted using the GTR+I+Gevolution model based on Bayesian inference (BI) in MrBayes [34]. The parameter was set to run for five million generations and sampled every 1,000 generations, with all other settings left at their defaults, and the first 25% of each run was discarded as burn-in. The alignment was also evaluated using bootstrap analysis on 1000 in a maximum likelihood (ML) by RAxML [35], with parameters: raxmlHPC-PTHREADS-SSE3 -fa -N 1000-m GTRGAMMA -x 551,314,260 -p 551,314,260 -o Cornus_officinalis _NC_042746-T 20, 1000 replications and best-fit model selection.

Chloroplast genome information
The basic information of the cp genome of three H. forbesii is shown in Table 1 and compared with other species of the Hansenia genus and Arabidopsis thaliana (S2 Table). The length of the cp genome of H. forbesii QX was the longest in three varieties (157,181 bp), the total GC content of three cp genomes was about 37.6%, and they all had a typical four-region structure, in which the LSC (large single-copy region) was 86,167 bp (LQ), 86,361 bp (QX), and 86,188 bp (WQ), the SSC (small single-copy region) was 17,755 bp (LQ), 17,786 bp (QX) and 17,755 bp (WQ) and the IR (inverted repeats, including IRa and IRb) was 26,516 bp (LQ), 26,517 bp (QX), and 26,516 bp (WQ) (Fig 2). A total of 129 genes were found in these varieties, including 84 protein-coding genes, 37 tRNA genes, and 8 rRNA genes ( Table 2). The number and types of introns were similar among the three H. forbesii. The four types of cp genes contained 45 photosynthesis-related genes, 29 self-replication-related genes, six other genes, and four unknown genes. Amongst them, ndhB, rpl2, rpl23, rps12, rps7, and ycf2 appeared twice. Eleven genes each contained one intron, including rpl2 (×2) and ndhB (×2), which were located in the IR, and the genes (rps16, atpF, rpoC1, petB, petD, and rpl16) were located in the LSC, and the ndhA was the only present in the SSC region. In addition, the ycf3 and clpP comprise two introns.

Comparison of chloroplast introns and exons
By analyzing the cp genome annotation files of the three varieties of H. forbesii, we found that the introns and exons of 21 genes in the cp genome of the three varieties were different. QX had one more bp than LQ and WQ in terms of the introns in the trnK-UUU, rps16, atpF, ycf3, clpP, and rpl16, but one less bp in rpoC1 (S3 Table). Studies have shown that introns can reduce gene splicing and translation, slow down cell metabolism and reduce energy consumption [36] and the ClpP gene is also related to plant disease resistance, which may be the reason why QX has better disease resistance than LQ and WQ.

Codon usage
The cp genome of three H. forbesii contained 64 codons encoding 20 amino acids. The result of the RSCU revealed that 32 codons were used frequently in these cultivars, with the highest frequency of UUA followed by AGA (Fig 3). Moreover, the codon exhibited a strong bias toward an A or T at the third position. The codons that contain A/T at the 3' end mostly have RSCU �1, whereas the codons are having C or G at the 3' end mostly have RSCU �1. Amino acid frequency analyses revealed the highest frequency of Leucine, whereas Tryptophane was a rare amino acid. In general, we found high similarities in codon usage and amino acid frequency among the three cultivated varieties, and both contain high AT content.

Repeat analysis
Our analyses identified SSRs per genome composed of mono-to di-nucleotide repeating units (Fig 4a). The number and type of SSRs in LQ and WQ were similar, with 34 single nucleotide repeats and 2 dinucleotide repeats. QX contains 39 single nucleotide repeats more than LQ and WQ. Moreover, the main type of mononucleotide repeats was T. Oligonucleotide repeats in three varieties.  30 to 40 bp. This result showed that QX was more different than LQ and WQ. We also evaluated the number of repeats about the species' phylogenetic position using the topology in Fig 9.

Inverted repeats contraction and expansion
The inverted repeats contraction and expansion revealed variation at LSC/IRs/SSC junctions (Fig 5). In three varieties, a truncated copy of the rps19 gene was found at the IRb/LSC junction; the rpl22 gene was found entirely in the LSC region; and the rpl2 gene was found entirely in the IRb region. Another truncated copy of the ndhF gene was found at the junction of IRb/SSC in all species, which starts in IRb regions and integrates into the SSC region. Moreover, a truncated copy of ycf1 was found in the SSC/IRa junction, which was longer in IRa of H. forbesii. In three varieties, trnH completely exists in LSC and only 3 bp from the junction of IRa/LSC. These results showed that the chloroplast genome of these three varieties did not expand or contract.

Comparative cp genomic analysis
The cp genome of the three varieties was compared by mVISTA, and the H. forbesii (NC035054) was used as the reference sequence for annotation. The three varieties exhibited similar variation sites and degrees of variation (Fig 6). The coding regions (CDS) were more conserved than the intergenic spacers (IGS). The high divergence area in IGS was found in petN-psbM, trnY(GUA)-trnT(GGU), ndhC-trnV(UAC), petA-psbJ, rps8-rpl14, rrn4.5-rrn5. Furthermore, some mutations of CDS were found in rpoC1 and rpl2. Moreover, the result showed that IR regions had lower sequence divergence than LSC and SSC regions.   We used the DnaSP software to compare the nucleotide variation values (Pi) between the whole cp genome of the three varieties. The hypervariable regions were detected, and the sequence differences were analyzed. Sliding window analysis revealed that the nucleotide diversity values varied from 0 to 0.008. Three mutational hotspots in the LSC and SSC regions were identified, including atpF-atpH, petD, and rps15-ycf1 (Fig 7). These can be used as potential sites for studying population genetics and the identification of these three varieties.

Specific DNA barcode maker design for three varieties
To discriminate the three varieties, we selected two hypervariable regions, trnC-GCA-petN (29,893 bp) and trnQ-UUG-psbK (7,894-8,266 bp), to develop specific DNA barcode. The related primer sequence is shown in S4 Table. PCR amplification of total DNA from three varieties resulted in products having the expected size (S1 Fig). In terms of the amplification of primer 2, QX exhibited more polymorphic loci compared to LQ and WQ. Under the amplification of primer 1, all three varieties showed a single bright band. The DNA fragments were extracted from each band and then subjected to Sanger sequencing. The sequencing results were identical to the expected sequences (Fig 8). The barcode has two specific Indel loci. These two variable loci can be used to differentiate three varieties.

Phylogenetic analysis
The phylogenetic trees (ML and BI) were constructed by the shared protein-coding genes to investigate the phylogeny of the three H. forbesii, and C. officinalis (NC042746) was the outgroup. Most nodes were supported with maximum support (100% bootstrap support) on the consensus trees, and the topological structure of the two phylogenetic trees had perfect consistency (Fig 9). Three varieties of H. forbesii clustered with Hansenia forbesii (NC035054) and then merged with H. forrestii (NC035056) and H. oviformis (NC035055), indicating that LQ and WQ were closely related, followed by QX, and they were closely related to Hansenia forbesii (NC035054). Finally, they became a clade with Changium smyrnioides (NC053938), Tongoloa silaifolia (NC049062), H. weberbaueriana (NC035053), Haplosphaera himalayensis (NC056096). This clade had a close genetic relationship, in which QX was relatively distant from LQ and WQ.

Discussion and conclusion
This study found that the three varieties of Hansenia forbesii had similar genetic relationships. Only the length of introns of some H. forbesii QX genes was longer than H. forbesii LQ and H. forbesii WQ. We found three hypervariable regions between the cp genome of the three varieties of H. forbesii, including atpF-atpH, petD, and rps15-ycf1. So the character differences between the three varieties of H. forbesii could be judged by the complete cp genome. Studies have shown that introns can reduce gene splicing and translation, slow down cell metabolism and reduce energy consumption [36]. QX had one more bp than LQ and WQ in terms of the introns in the trnK-UUU, rps16, atpF, ycf3, clpP, and rpl16, but one less bp in rpoC1. Taking Caseinolytic protease (ClpP) as an example, an ATP-dependent and highly conserved serine protease [37], it is essential for the regulation of the cellular stress response, cell homeostasis, and bacterial virulence [38,39]. When encoding ClpP, the increase in introns length may affect gene splicing and translation, thereby affecting the expression of ClpP in QX, resulting in better disease resistance. Amino acid frequency analyses revealed the highest frequency of Leucine, Leucine in plants can enhance resistance to salt stress, enhance pollen vitality and germination, and enhance fragrance. The highest frequency of Leucine may also be one of the reasons why the three varieties have aromatic aromas.
The IR regions exhibited lower sequence divergence than LSC and SSC regions and were consistent with the previous comparisons of the cp genome [40]. Intergenic spacer regions are the most frequently used cp markers for phylogenetic studies at lower taxonomic levels in plants [41], as they are regarded as more variable and could provide more phylogenetically informative characters. In a previous study, rps15-ycf1 was identified as a mutational hotspot for identifying Isodon rubescens [18]. The PetD located in the LSC region, as a protein-coding gene, contains an intron, which is consistent with previous genetic studies [40] and can also be used as a highly variable region for species identification. Phylogenetic analysis further showed that H. forbesii LQ and H. forbesii WQ were closely related, followed by H. forbesii QX, and they were closely related to Hansenia forbesii (NC035054), this also indirectly indicated the reliability of the data of three varieties of H. forbesii.
The Cp genome has a moderate rate of nucleotide evolution, which results in its suitability for species identification and phylogenetic studies at different taxonomic levels [42,43]. Chen et al. identified the genus Clerodendrum with no significant difference in external morphology using the cp genome, they found two intergenic regions: trnH-GUG-psbA, nhdD-psaC, exhibiting a high degree of variations and showed that the complete cp genome can be used as a super-barcode to identify these Clerodendrum species [44]. Wu et al. compared five Crataegus cp genomes and found five highly variable regions, namely petG-trnW-CCA, trnH-GUG-psbA, trnR-UCU-atpA, petA-psbJ, and trnT-GGU-psbD, indicating that the complete cp genome could also be used as a super-barcode to accurately authenticate the five Crataegus species [45]. Fu et al. obtained 48 cp genome from 16 Taxus species and Pseudotaxus chienii and constructed the ML evolutionary tree, which showed that the cp genome can successfully distinguish all Taxus species and can be used as a super barcode to identify species [46]. Lan et al. analyzed five cp genomes from Artemisia species, and showed that the cp genome can provide distinguishing features to help identify closely related Artemisia species and had the potential to serve as a universal super barcode for plant identification [20]. Most studies on the application of the cp genome have pointed out that a complete cp genome can be used as super barcodes for species identification. We found that the internal reason for the difference in the external phenotypes of the three varieties of H. forbesii can be reflected in the highly variable region of the cp genome, so the complete cp genome of three varieties of H. forbesii can be used as the DNA super barcode for identifying these three varieties. And trnC-GCA-petN can be used as a specific DNA barcode to effectively identify the three varieties. This assists in solving the problem of mixed varieties in the actual production of H. forbesii, and can also be taken into account by other plants, which has great significance. In addition, the genetic relationship of the H. forbesii QX variety is far away from the other two varieties, so it can be selected as the first-generation variety for single plant optimization and pure breeding to ensure its excellent phenotypes, and it can be popularized and applied on a large area. This study can provide a reference for the sustainable utilization of Hansenia forbesii resources and the identification of different varieties with different phenotypes of the same species.
Supporting information S1 Table. The